Preprocessing Yelp Image and Text data for NLP/CV modelling¶

Last updated : October 16th, 2022

Introduction¶

During this project, I will preprocess 2 datasets from yelp : a 7GB reviews dataset and a 9GB photos dataset. This preprocessed data will then be fed respectively to a NLP and CV models. This notebook only includes the preprocessing part of this project.

To check the viability of our preprocessing pipelin in production, I will also implement a Yelp API querying algorithm to query new data.

In [2]:
#Importing packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#Setting large figure size for Seaborn
sns.set(rc={'figure.figsize':(11.7,8.27),"font.size":20,"axes.titlesize":20,"axes.labelsize":18})
#Importing Intel extension for sklearn to improve speed
# from sklearnex import patch_sklearn, unpatch_sklearn
# patch_sklearn()
#import cudf
import dill

1. Text data preprocessing¶

1.1 Data loading and filtering¶

First of all we will load the business and reviews datasets. The other 3 datasets have been examined but have no interest for this project.

Below are the first few lines of the business dataset :

In [12]:
#64GB of RAM so no need to compress data
business = pd.read_json("Data/yelp_academic_dataset_business.json", lines=True)
# checkins = pd.read_json("Data/yelp_academic_dataset_checkin.json", lines=True)
reviews = pd.read_json("Data/yelp_academic_dataset_review.json", lines=True)
# tips = pd.read_json("Data/yelp_academic_dataset_tip.json", lines=True)
# users = pd.read_json("Data/yelp_academic_dataset_user.json", lines=True)

business.head()
Out[12]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 Pns2l4eNsfO8kk83dixA6A Abby Rappoport, LAC, CMQ 1616 Chapala St, Ste 2 Santa Barbara CA 93101 34.426679 -119.711197 5.0 7 0 {'ByAppointmentOnly': 'True'} Doctors, Traditional Chinese Medicine, Naturop... None
1 mpf3x-BjTdTEA3yCZrAYPw The UPS Store 87 Grasso Plaza Shopping Center Affton MO 63123 38.551126 -90.335695 3.0 15 1 {'BusinessAcceptsCreditCards': 'True'} Shipping Centers, Local Services, Notaries, Ma... {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...
2 tUFrWirKiKi_TAnsVWINQQ Target 5255 E Broadway Blvd Tucson AZ 85711 32.223236 -110.880452 3.5 22 0 {'BikeParking': 'True', 'BusinessAcceptsCredit... Department Stores, Shopping, Fashion, Home & G... {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 4.0 80 1 {'RestaurantsDelivery': 'False', 'OutdoorSeati... Restaurants, Food, Bubble Tea, Coffee & Tea, B... {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
4 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery 101 Walnut St Green Lane PA 18054 40.338183 -75.471659 4.5 13 1 {'BusinessAcceptsCreditCards': 'True', 'Wheelc... Brewpubs, Breweries, Food {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2...

Here is the shape of the first lines on the reviews dataset :

In [3]:
reviews.head()
Out[3]:
review_id user_id business_id stars useful funny cool text date
0 KU_O5udG6zpxOg-VcAEodg mh_-eMZ6K5RLWhZyISBhwA XQfwVwDr-v0ZS3_CbbE5Xw 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11
1 BiTunyQ73aT9WBnpR9DZGw OyoGAe7OKpv6SyGZT5g77Q 7ATYjTIgM3jUlt4UM3IypQ 5 1 0 1 I've taken a lot of spin classes over the year... 2012-01-03 15:28:18
2 saUsX_uimxRlCVr67Z4Jig 8g_iMtfSiwikVnbP2etR0A YjUWPpI6HXG530lwP-fb2A 3 0 0 0 Family diner. Had the buffet. Eclectic assortm... 2014-02-05 20:30:30
3 AqPFMleE6RsU23_auESxiA _7bHUi9Uuf5__HHc_Q8guQ kxX2SOes4o-D3ZQBkiMRfA 5 1 0 1 Wow! Yummy, different, delicious. Our favo... 2015-01-04 00:01:03
4 Sx8TMOWLNuJBWer-0pcmoA bcjbaE6dDog4jkNY91ncLQ e4Vwtrqf-wpJfwesgvdgxQ 4 1 0 1 Cute interior and owner (?) gave us tour of up... 2017-01-14 20:54:15

Since we are working for a restaurant company, we are only interested in businesses which are restaurants. We will create a list of the restaurants businesses and merge it with our reviews dataframe to filter out reviews from other kind of businesses.

In [13]:
import re

def restaurant_select(x):
    y = str(x)
    r = re.compile(r'.*Restaurant.*')
    if re.match(r, y):
        return 1
    else:
        return 0
    


business["is_restaurant"] = business["categories"].apply(restaurant_select)

#Keeping only restaurants
business = business[business.is_restaurant == 1]

business = business[["business_id"]]

business
Out[13]:
business_id
3 MTSW4McQd7CbVtyjqoe9mw
5 CF33F8-E6oudUQ46HnavjQ
8 k0hlBqXX-Bt0vf1op7Jr1w
9 bBDDEgkFA1Otx9Lfe7BZUQ
11 eEOYSgkmpB90uNA7lDOMRA
... ...
150325 l9eLGG9ZKpLJzboZq-9LRQ
150327 cM6V90ExQD6KMSU3rRB5ZA
150336 WnT9NIzQgLlILjPT0kEcsQ
150339 2O2K6SXPWv56amqxCECd4w
150340 hn9Toz3s-Ei3uZPt7esExA

52286 rows × 1 columns

In [14]:
#Keeping only reviews on restaurants
reviews = pd.merge(reviews, business, on="business_id", how="inner")

#Dropping user_id and business_id
reviews.drop(columns={"user_id", "business_id"}, inplace=True)
reviews.set_index("review_id", inplace=True)

print(reviews.shape)
reviews.head()
(4724684, 6)
Out[14]:
stars useful funny cool text date
review_id
KU_O5udG6zpxOg-VcAEodg 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11
VJxlBnJmCDIy8DFG0kjSow 2 0 0 0 This is the second time we tried turning point... 2017-05-13 17:06:55
S6pQZQocMB1WHMjTRbt77A 4 2 0 1 The place is cute and the staff was very frien... 2017-08-08 00:58:18
WqgTKVqWVHDHjnjEsBvUgg 3 0 0 0 We came on a Saturday morning after waiting a ... 2017-11-19 02:20:23
M0wzFFb7pefOPcxeRVbLag 2 0 0 0 Mediocre at best. The decor is very nice, and ... 2017-09-09 17:49:47

We will now calculate the length of different comments and look at the distribution of this variable.

In [15]:
reviews["text_length"] = reviews["text"].apply(len)

reviews.head()
Out[15]:
stars useful funny cool text date text_length
review_id
KU_O5udG6zpxOg-VcAEodg 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11 513
VJxlBnJmCDIy8DFG0kjSow 2 0 0 0 This is the second time we tried turning point... 2017-05-13 17:06:55 477
S6pQZQocMB1WHMjTRbt77A 4 2 0 1 The place is cute and the staff was very frien... 2017-08-08 00:58:18 216
WqgTKVqWVHDHjnjEsBvUgg 3 0 0 0 We came on a Saturday morning after waiting a ... 2017-11-19 02:20:23 736
M0wzFFb7pefOPcxeRVbLag 2 0 0 0 Mediocre at best. The decor is very nice, and ... 2017-09-09 17:49:47 953
In [17]:
plt.hist(reviews["text_length"])
plt.title("Histogram of the length of comments")
plt.xlabel("Length (characters)")
plt.ylabel("Number of comments")
plt.show()
plt.show()

#looking at the distribution of reviews with less than 1000 words
plt.hist(reviews[reviews.text_length < 1000]["text_length"])
plt.title("Histogram of the length of comments with less than 1000 words")
plt.xlabel("Length (characters)")
plt.ylabel("Number of comments")
plt.show()

We can see that most comments have approximatively 200 words.

In [8]:
#Looking at reviews with less than 50 words
reviews[reviews.text_length < 50]
#Still some indications (ex. the gluten free pizza is unbeatable), we will not delete these samples
#We will delete reviews with less than 10 characters.

reviews = reviews[reviews.text_length>10]

reviews[reviews.text.isna()]
#No NA values

reviews.info(verbose=True, show_counts=True)

#reviews = cudf.from_pandas(reviews)

reviews.head()
<class 'pandas.core.frame.DataFrame'>
Index: 4724551 entries, KU_O5udG6zpxOg-VcAEodg to nGLcmo0D3IKrqqgK1kutlA
Data columns (total 7 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   stars        4724551 non-null  int64         
 1   useful       4724551 non-null  int64         
 2   funny        4724551 non-null  int64         
 3   cool         4724551 non-null  int64         
 4   text         4724551 non-null  object        
 5   date         4724551 non-null  datetime64[ns]
 6   text_length  4724551 non-null  int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 288.4+ MB
Out[8]:
stars useful funny cool text date text_length
review_id
KU_O5udG6zpxOg-VcAEodg 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11 513
VJxlBnJmCDIy8DFG0kjSow 2 0 0 0 This is the second time we tried turning point... 2017-05-13 17:06:55 477
S6pQZQocMB1WHMjTRbt77A 4 2 0 1 The place is cute and the staff was very frien... 2017-08-08 00:58:18 216
WqgTKVqWVHDHjnjEsBvUgg 3 0 0 0 We came on a Saturday morning after waiting a ... 2017-11-19 02:20:23 736
M0wzFFb7pefOPcxeRVbLag 2 0 0 0 Mediocre at best. The decor is very nice, and ... 2017-09-09 17:49:47 953

After analyzing the length of comments, we will look at their polarity.

1.2 Sentiment Analysis¶

Since we are interested only in negative reviews to find out the main topics, we will calculate the polarity of each comment and then keep only negative reviews.

In [19]:
#Using Vader to calculate the polarity of our reviews
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

example = reviews.iloc[3,:]

sentences = example.text.split('.')

#Calculating the difference if we calculate the polarity from all the review or if we split it in sentences
print(analyzer.polarity_scores(example.text)["compound"])

scores = []
for s in sentences:
    scores.append(analyzer.polarity_scores(s)["compound"])
print(np.mean(scores))
#Significant difference !

#Looking at the actual review :
print(sentences)
#Clearly, the review is not that positive and it is even slightly negative

def analyze_polarity(x):
    sentences = str(x).split('.')
    scores = []
    for s in sentences:
        scores.append(analyzer.polarity_scores(s)["compound"])
    return np.mean(scores) 
0.8333
0.12201999999999999
['We came on a Saturday morning after waiting a few months after opening hoping that they would resolve the issues from a new restaurant opening', ' We were seated right away and the server brought water, coffee and took our orders right away', ' We waited over 30 mins for breakfast', " I got the freebird and came out first before my husband's dish", ' While it tastes good, it was just potatoes and the spicy sausage gravy was mostly a sauce', ' There was barely any sausage', ' My husband got the ny deli omelette that had way too much cheese that it overpowered everything and very little pastrami', " Lastly, we were ready to go and our server spent at least 10 mins chatting at another table so I couldn't get our check", " I'm not sure if we will return", '']
In [18]:
reviews["polarity"] = reviews["text"].apply(analyze_polarity)

reviews.head()
Out[18]:
stars useful funny cool text date text_length polarity
review_id
KU_O5udG6zpxOg-VcAEodg 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11 513 0.230957
VJxlBnJmCDIy8DFG0kjSow 2 0 0 0 This is the second time we tried turning point... 2017-05-13 17:06:55 477 -0.042017
S6pQZQocMB1WHMjTRbt77A 4 2 0 1 The place is cute and the staff was very frien... 2017-08-08 00:58:18 216 0.371014
WqgTKVqWVHDHjnjEsBvUgg 3 0 0 0 We came on a Saturday morning after waiting a ... 2017-11-19 02:20:23 736 0.122020
M0wzFFb7pefOPcxeRVbLag 2 0 0 0 Mediocre at best. The decor is very nice, and ... 2017-09-09 17:49:47 953 0.059770
In [26]:
#Removing date and review_id data that we will not exploit to reduce dataframe size

reviews.reset_index(inplace=True)
reviews.drop(columns=["review_id","date"], inplace=True)

#Saving our current reviews file
with open('Data/reviews.pkl', 'wb') as file:
    dill.dump(reviews, file)
In [20]:
with open('Data/reviews.pkl', 'rb') as file:
    reviews = dill.load(file)

Let's look at the Distribution of polarity for our reviews :

In [21]:
plt.hist(reviews["polarity"])
plt.title("Histogram of the polarity of reviews")
plt.xlabel("Polarity of Reviews")
plt.ylabel("Number of Reviews")
plt.show()

It is clear that most reviews have a polarity of around 0. Let's now look at the distribution of the number of stars in each review:

In [22]:
plt.hist(reviews["stars"], bins=5)
plt.title("Histogram of the number of stars in Yelp Reviews")
plt.xlabel("Number of Stars")
plt.ylabel("Number of Reviews")
plt.show()

Since we are mainly interested in negative reviews, let's look at the number of stars of negative reviews (with negative polarity) :

In [23]:
#Looking at the star ratings over reviews with less than 0 polarity
plt.hist(reviews[reviews.polarity < 0]["stars"], bins=5)
plt.title("Number of stars of Negative Yelp Reviews")
plt.xlabel("Number of stars")
plt.ylabel("Number of reviews")
#This mostly validates our polarity scoring methodology
plt.show()

This confirms the relevance of our sentiment analysis since most reviews with negative polarity have between 1 and 2 stars.

Since we are only interested in negative reviews for this analysis, we will filter out the reviews with a positive polarity and with more than 2 stars.

In [14]:
#Since we want to keep only topics of dissatisfaction, we will only keep reviews with 1-2 stars with a negative polarity
df = reviews.loc[(reviews.polarity < 0) & (reviews.stars <= 2)].copy()

df["text"] = df["text"].apply(lambda x: str(x).lower())

df.shape
Out[14]:
(486008, 8)

Before analyzing our reviews, we need to perform some basic operations to preprocess our text data.

1.3 Text Normalization¶

We will use Spacy to Lemmatize and Tokenize our dataset.

This will remove stop words and punctuation and reduce the size of each reviews so that it will be easier to analyze them.

In [75]:
import spacy
spacy.prefer_gpu()

#Using the trf model is too long, so we will use the sm model here, would need to be changed for production
#nlp = spacy.load("en_core_web_trf")
nlp = spacy.load("en_core_web_sm")

def lemmatize(x):
    doc = nlp(x)
    tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
    return ' '.join(tokens)

df["lemma_text"] = df["text"].apply(lemmatize)
In [4]:
#Saving our lemmatized reviews file
with open('Data/lemma_reviews.pkl', 'wb') as file:
    dill.dump(df, file)
In [2]:
with open('Data/lemma_reviews.pkl', 'rb') as file:
    df = dill.load(file)

We will begin by applying manual text vectorization and dimensionality reduction. In a following part, we will show how we can use the BERTopic module to extract topics of interest.

2. Manual Text Vectorization and Dimensionality Reduction¶

First, we need to turn our features into vectors. We will use the TF-IDF Vectorizer which is quite fast.

2.1 Vectorizing our dataset¶

In [129]:
from sklearn.feature_extraction.text import TfidfVectorizer

X = df["lemma_text"]

model = TfidfVectorizer(lowercase=True, max_features=1000)
X_tr = model.fit_transform(X)
In [130]:
X_tr.shape
Out[130]:
(486008, 1000)

This embedding algorithm has reduced the size of our text data to 1000 features. Now we will perform dimensionality reduction to be able to visualize our text data.

2.2 Dimensionality reduction using UMAP¶

We reduce the dimensionality of our vector a first time before applying HDBSCAN clustering to speed up the clustering process. We will then further reduce the dimonsionality to 2 features for better visualization.

In [131]:
import umap
X_reduced = umap.UMAP(n_neighbors=15,
                      n_components=10,
                      metric='cosine').fit_transform(X_tr)
In [133]:
print(X_reduced.shape)
X_reduced
(486008, 10)
Out[133]:
array([[ 9.560419 ,  9.0328455,  3.0807197, ...,  5.0644646,  4.073515 ,
         9.034298 ],
       [ 9.71617  ,  8.386079 ,  2.5083742, ...,  5.7708173,  5.4900146,
        10.477584 ],
       [ 9.954774 , 10.604382 ,  3.0913274, ...,  4.6353145,  3.0057309,
        10.113983 ],
       ...,
       [ 9.354248 ,  9.711932 ,  3.033503 , ...,  5.6380095,  4.4658704,
         8.070437 ],
       [10.355504 ,  9.899299 ,  2.5777116, ...,  4.508413 ,  4.393826 ,
         9.973897 ],
       [10.280046 ,  9.876802 ,  2.5860057, ...,  4.5620923,  4.2433076,
        10.015702 ]], dtype=float32)

Our first use of UMAP has divided by 100 our number of features. We can then find clusters using HDBSCAN to identify topics of interest.

3.3 Clustering with HDBSCAN and UMAP visualization¶

Now that we've reduced the dimensionality to 10, we will apply HDBSCAN to find clusters (or topics) which we will then visualize by reducing the dimensionality further to 2.

In [139]:
import cuml
cluster = cuml.HDBSCAN(min_cluster_size=15,
                       metric='euclidean',
                       cluster_selection_method='eom',
                       verbose = True).fit(X_reduced)
print("Clustering completed")
#Visualization, reapplying UMAP
umap_viz = umap.UMAP(n_neighbors=15, n_components=2, verbose=True, metric='cosine').fit_transform(X_tr)
result = pd.DataFrame(umap_viz, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()
Clustering completed
UMAP(angular_rp_forest=True, metric='cosine', verbose=True)
Mon Oct 10 17:18:00 2022 Construct fuzzy simplicial set
Mon Oct 10 17:18:00 2022 Finding Nearest Neighbors
Mon Oct 10 17:18:00 2022 Building RP forest with 40 trees
Mon Oct 10 17:19:16 2022 metric NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	 7  /  19
	 8  /  19
	 9  /  19
	 10  /  19
	Stopping threshold met -- exiting after 10 iterations
Mon Oct 10 17:27:28 2022 Finished Nearest Neighbor Search
Mon Oct 10 17:27:30 2022 Construct embedding
Epochs completed:   0%|            0/200 [00:00]
Mon Oct 10 17:32:02 2022 Finished embedding
Out[139]:
<matplotlib.colorbar.Colorbar at 0x7f270c279af0>
In [140]:
len(result.labels.unique())
Out[140]:
976

This method has identified 976 topics of interest, which is quite a high number. It would probably be interesting to optimize the hyperparameters of our UMAP and HDBSCAN algorithms to find a better number of clusters.

Visualization with UMAP shows some clusters but there are too many to be able to tell if this method was useful at separating different clusters.

3. Topic Identification¶

We will now use the BERTopic package to identify topics of dissatisfaction. Its pipeline is shown below :

BERTopic_pipeline.PNG

Since we select default settings, it will perform embedding using SBERT, Dimensionality Reduction using UMAP and clustering with HDBSCAN. After completion, we will perform the embedding, dimensionality reduction and clustering steps manually to show alternate methods.

In [4]:
from bertopic import BERTopic
import hdbscan
%set_env TOKENIZERS_PARALLELISM=True

# -- Custom HDBSCAN
bertopic_params = {}
bertopic_params['hdbscan_model'] = hdbscan.HDBSCAN(min_cluster_size=10,
                                                   metric='euclidean',
                                                   cluster_selection_method='eom',
                                                   prediction_data=True,
                                                   core_dist_n_jobs=1)

topic_model = BERTopic(language="english", verbose=True, **bertopic_params)
topics, probs = topic_model.fit_transform(df["lemma_text"].to_list())
env: TOKENIZERS_PARALLELISM=True
Batches:   0%|          | 0/15188 [00:00<?, ?it/s]
2022-10-12 19:55:17,573 - BERTopic - Transformed documents to Embeddings
2022-10-12 20:01:26,665 - BERTopic - Reduced dimensionality
2022-10-12 20:03:22,968 - BERTopic - Clustered reduced embeddings
In [7]:
#Saving our topic_model
with open('Data/topic_model.pkl', 'wb') as file:
    dill.dump(topic_model, file)
#Saving the topics and probabilities
with open('Data/topics.pkl', 'wb') as file:
    dill.dump(topics, file)
with open('Data/probs.pkl', 'wb') as file:
    dill.dump(probs, file)
In [3]:
with open('Data/topic_model.pkl', 'rb') as file:
    topic_model = dill.load(file)
In [4]:
topic_model.visualize_barchart(top_n_topics=8)

We can see that some identified topics are relevant, like Topic 2 and 7, but some other topics just regroup types of food.

We will now perform Sentiment analysis on the words identified in each topic in order to filter out Neutral topics (i.e. based on the type of food) and only select topics including negative sentiment.

In [5]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def score_topics_polarity(model, number):
    analyzer = SentimentIntensityAnalyzer()
    scores = []
    for i in range(number):
        word_list = []
        for t in topic_model.get_topic(i):
            word_list.append(t[0])
        words = ' '.join(word_list)
        scores.append({'Number': i, 'Polarity': analyzer.polarity_scores(words)["compound"]})
    return pd.DataFrame(scores)

topic_polarity = score_topics_polarity(topic_model, 50)

topic_polarity.head()
Out[5]:
Number Polarity
0 0 0.0000
1 1 0.0000
2 2 -0.0516
3 3 0.0000
4 4 -0.5719

As is seen here, most topics do not contain any polarized world and have a neutral polarity. Other topics, like the previously identified Topic 2, have a negative polarity.

It is those topics we want to select, so let's filter our topics by negativity:

In [6]:
negative_topics = topic_polarity[topic_polarity.Polarity < 0]["Number"].to_list()

topic_model.visualize_barchart(negative_topics, n_words=10)

We can see that most of the identified topics are relevant, for easier display, we will plot Wordclouds for each of those topics :

In [147]:
# import the wordcloud library
from wordcloud import WordCloud
# Instantiate a new wordcloud.
wordcloud = WordCloud(random_state = 8,
        normalize_plurals = False,
        width = 600, height= 300,
        max_words = 300,
        stopwords = [])
# Apply the wordcloud to the text.

def generate_topic_wordclouds(topic_indexes):
    for i in topic_indexes:
        word_dict = {}
        for t in topic_model.get_topic(i):
            word_dict[t[0]]=int(t[1]*10000)
        wordcloud.generate_from_frequencies(word_dict)
        fig, ax = plt.subplots(1,1, figsize = (9,6))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis("off")
        plt.show()
        
generate_topic_wordclouds(negative_topics)
        
        

This concludes our topic analysis. The topics are ranked by frequency so our client can increase his customer base by avoiding this common pitfalls.

Here are the top 3 identified topics of dissatisfaction :

  • Dirty bathroom and floor, bad smell
  • Rude bartender that ignore females and the presence of drunk people
  • Overpriced food with small portions and poor service

4. Dynamic data retrieval using Yelp API¶

In this part of the project, we will show how our dataset can be updated in production by querying 200 additional reviews from the Yelp API.

In [113]:
import cred
import requests

api_key = cred.api_key

loc_list = ['USA', 'NY', 'LA', 'Washington', 'DC', 'SF', 'Chicago'] #Iterate over different locations to avoid duplicates
categories = 'Restaurants'
attributes = 'hot_and_new' #Retrieve only new businesses to avoid duplicates with "old" dataframe
SEARCH_LIMIT = 50 #Search limit

def retrieve_yelp_reviews(loca_list=loc_list, n_businesses=200, n_reviews=600):

    biz_url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
            'Authorization': 'Bearer {}'.format(api_key),
        }

    responses = []
    for loc in locations:

        url_params = {
                        'location': loc + '+',
                        'categories': categories + '+',
                        'attributes': attributes + '+',
                        'limit': SEARCH_LIMIT
                    }
        response = requests.get(biz_url, headers=headers, params=url_params)

        #Checking for valid response code status
        if response.status_code == 200:

            responses += response.json()['businesses']
            
            if len(responses) >= n_businesses:
                break
                
    new_biz = pd.DataFrame.from_dict(responses)


    new_rev = []

    for i in new_biz["id"].to_list():
        url = "https://api.yelp.com/v3/businesses/" + str(i) + "/reviews"

        response = requests.get(url, headers=headers, params=None)

        if response.json():

            new_rev += response.json()['reviews']


    new_rev = pd.DataFrame(new_rev)
    
    return new_biz, new_rev
In [114]:
new_biz, new_rev = retrieve_yelp_reviews()

new_biz
Out[114]:
id alias name image_url is_closed url review_count categories rating coordinates transactions price location phone display_phone distance
0 wGl_DyNxSv8KUtYgiuLhmA bi-rite-creamery-san-francisco Bi-Rite Creamery https://s3-media3.fl.yelpcdn.com/bphoto/c5-w8m... False https://www.yelp.com/biz/bi-rite-creamery-san-... 9911 [{'alias': 'icecream', 'title': 'Ice Cream & F... 4.5 {'latitude': 37.761591, 'longitude': -122.425717} [delivery] $$ {'address1': '3692 18th St', 'address2': None,... +14156265600 (415) 626-5600 946.386739
1 lJAGnYzku5zSaLnQ_T6_GQ brendas-french-soul-food-san-francisco-6 Brenda's French Soul Food https://s3-media4.fl.yelpcdn.com/bphoto/VJ865E... False https://www.yelp.com/biz/brendas-french-soul-f... 11721 [{'alias': 'breakfast_brunch', 'title': 'Break... 4.0 {'latitude': 37.7829016035273, 'longitude': -1... [delivery] $$ {'address1': '652 Polk St', 'address2': '', 'a... +14153458100 (415) 345-8100 2885.389131
2 WavvLdfdP6g8aZTtbBQHTw gary-danko-san-francisco Gary Danko https://s3-media3.fl.yelpcdn.com/bphoto/eyYUz3... False https://www.yelp.com/biz/gary-danko-san-franci... 5748 [{'alias': 'newamerican', 'title': 'American (... 4.5 {'latitude': 37.80587, 'longitude': -122.42058} [] $$$$ {'address1': '800 N Point St', 'address2': '',... +14157492060 (415) 749-2060 5191.341803
3 ri7UUYmx21AgSpRsf4-9QA tartine-bakery-san-francisco-3 Tartine Bakery https://s3-media4.fl.yelpcdn.com/bphoto/QRbC0T... False https://www.yelp.com/biz/tartine-bakery-san-fr... 8530 [{'alias': 'bakeries', 'title': 'Bakeries'}, {... 4.0 {'latitude': 37.76131, 'longitude': -122.42431} [delivery] $$ {'address1': '600 Guerrero St', 'address2': ''... +14154872600 (415) 487-2600 1087.638933
4 76smcUUGRvq3k1MVPUXbnA mitchells-ice-cream-san-francisco Mitchells Ice Cream https://s3-media2.fl.yelpcdn.com/bphoto/f4lzrs... False https://www.yelp.com/biz/mitchells-ice-cream-s... 4530 [{'alias': 'icecream', 'title': 'Ice Cream & F... 4.5 {'latitude': 37.744221, 'longitude': -122.422791} [pickup, delivery] $ {'address1': '688 San Jose Ave', 'address2': '... +14156482300 (415) 648-2300 2209.260424
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
195 w11bYFeSydqdUpuyEJoXkg rachel-lake-cle-elum Rachel Lake https://s3-media1.fl.yelpcdn.com/bphoto/c2owVq... False https://www.yelp.com/biz/rachel-lake-cle-elum?... 10 [{'alias': 'hiking', 'title': 'Hiking'}, {'ali... 3.5 {'latitude': 47.19518, 'longitude': -120.93829} [] NaN {'address1': '', 'address2': '', 'address3': '... 9218.085989
196 BvJEM79soFlapfgHIngnpA keg-cellar-tavern-cle-elum Keg Cellar Tavern https://s3-media1.fl.yelpcdn.com/bphoto/w38HE2... False https://www.yelp.com/biz/keg-cellar-tavern-cle... 6 [{'alias': 'bars', 'title': 'Bars'}] 4.0 {'latitude': 47.1943740844727, 'longitude': -1... [] $$ {'address1': '112 N Pennsylvania Ave', 'addres... +15096742277 (509) 674-2277 9283.402199
197 4_m5m6ciDSEC7F3la3C_zQ gravity-coffee-cle-elum-cle-elum Gravity Coffee - Cle Elum https://s3-media1.fl.yelpcdn.com/bphoto/3jEjYK... False https://www.yelp.com/biz/gravity-coffee-cle-el... 9 [{'alias': 'coffee', 'title': 'Coffee & Tea'},... 4.0 {'latitude': 47.19498, 'longitude': -120.95508} [] NaN {'address1': '808 West Davis St', 'address2': ... +12534478740 (253) 447-8740 9886.047475
198 QY6Q8bwDfQ6PZagXWvVcvw kodiak-coffee-roslyn Kodiak Coffee https://s3-media3.fl.yelpcdn.com/bphoto/MzvoJo... False https://www.yelp.com/biz/kodiak-coffee-roslyn?... 11 [{'alias': 'coffee', 'title': 'Coffee & Tea'}] 3.5 {'latitude': 47.2078147610546, 'longitude': -1... [] NaN {'address1': '3172 WA-903', 'address2': '', 'a... +15096493398 (509) 649-3398 10128.139910
199 h2SS90lvvuHrupQ7xWVh1Q 56-degrees-cle-elum 56 Degrees https://s3-media2.fl.yelpcdn.com/bphoto/Eikx0k... False https://www.yelp.com/biz/56-degrees-cle-elum?a... 23 [{'alias': 'newamerican', 'title': 'American (... 2.5 {'latitude': 47.2086001553586, 'longitude': -1... [] $$ {'address1': '3600 Suncadia Trl', 'address2': ... +15096496474 (509) 649-6474 12329.965839

200 rows × 16 columns

Above is the list of businesses we collected. We gathered the information on 200 businesses.

Note : It is necessary to include a location in the Yelp API business search. We have included a list of common US cities that we can iterate over but it would be more relevant to focus on the cities located in the vicinity of Good Dinner's Restaurants.

In [119]:
print("Number of reviews collected : {}".format(len(new_rev)))
new_rev.head()
Number of reviews collected : 600
Out[119]:
id url text rating time_created user
0 d9min1nLES_aJv-aEoLilQ https://www.yelp.com/biz/bi-rite-creamery-san-... Hands down my favorite ice cream spot in San F... 5 2022-09-05 15:24:19 {'id': 'SXJmAdpip5_vFFHPj7lwJQ', 'profile_url'...
1 yowp01_Ji6bnU5_OhRgxjA https://www.yelp.com/biz/bi-rite-creamery-san-... Really unique flavors -- I LOVED the ritual co... 4 2022-10-02 11:17:03 {'id': 'Jl1z3Wylzhb25XejTxPHtQ', 'profile_url'...
2 ljntTVjfL2BQPaiLMPZsGw https://www.yelp.com/biz/bi-rite-creamery-san-... We love Bi-Rite Creamery, especially for their... 4 2022-09-23 12:09:34 {'id': 'tctoFsg9byYvQ7OhdfzrvQ', 'profile_url'...
3 2fji5yUnZHTW-lZzoqaEqA https://www.yelp.com/biz/brendas-french-soul-f... Ah Brenda... you did not disappoint. \nFried C... 5 2022-10-09 09:33:09 {'id': 'vdyw_IXcFpfj6RrpUesPgw', 'profile_url'...
4 KfHGYGVl5PQqLTt4GgZB1w https://www.yelp.com/biz/brendas-french-soul-f... Service was horrible from our server, others a... 3 2022-10-10 05:28:31 {'id': 'WXoYGyHE5UrSC0YOd5e-7w', 'profile_url'...

Another limitation of the Yelp API is that only 3 reviews per business are retrievable. The reviews recovered are randomized so it would be possible to rerun the queries several times and gather different reviews.

Above is the list of the 600 recovered reviews.

Now it would be interesting to only keep negative reviews and so to perform the same filtering as in our preprocessing (only keeping the reviews with 1 or 2 stars and with negative polarity*

In [120]:
new_rev["polarity"] = new_rev["text"].apply(analyze_polarity)
new_rev["lemma_text"] = new_rev["text"].apply(lemmatize)

#Keeping only negative reviews
new_df = new_rev.loc[(new_rev.polarity < 0) & (new_rev.rating <= 2)].copy()

#Updating the topics with the new data
#topic_model.update_topics(new_revs["lemma_text"].to_list())

print(new_df.shape)
#Only  14 neagtive reviews out of the 600 collected!
new_df.head()
(14, 8)
Out[120]:
id url text rating time_created user polarity lemma_text
100 tQN6XoPK3zXheBirOUYP_w https://www.yelp.com/biz/limoncello-san-franci... The owner of this business is absolutely insan... 1 2022-09-05 13:55:32 {'id': 'AiCTjZiyaZ8bx-j_mqzdbg', 'profile_url'... -0.201133 owner business absolutely insane clue customer...
106 bbwo1l3OZCNoQkpABvj8EA https://www.yelp.com/biz/arizmendi-bakery-san-... I was craving some sweets and I went to this s... 2 2022-10-06 17:11:44 {'id': '5g969cG9I994x6OLGiP0SQ', 'profile_url'... -0.003343 crave sweet go store \n ask chocolate thing \n...
196 si-G-TWKkyCO4LvTSWRgsA https://www.yelp.com/biz/peter-luger-brooklyn-... [A racist establishment- do not frequent] \nI ... 1 2022-10-04 20:37:18 {'id': 'AMQofGG8AmqE6BW79gqFbQ', 'profile_url'... -0.153680 racist establishment- frequent \n rarely write...
352 41A5RZNPQ-MIdCGLWg1Llg https://www.yelp.com/biz/hae-jang-chon-los-ang... good food but bad service,\nwe felt rushed and... 2 2022-10-06 15:24:36 {'id': 'I5owhnCBGBsEX1FucVli6g', 'profile_url'... -0.160700 good food bad service \n feel rush want contro...
407 PB4F4pBQnDZT9dmLEgSC5A https://www.yelp.com/biz/daves-hot-chicken-los... From hero's to franchise owners. \nQuality at ... 2 2022-09-04 16:19:34 {'id': 'kYWig-IQj7noJ7F3-h1G9A', 'profile_url'... -0.014557 hero franchise owner \n quality location drast...

Out of these 600 reviews, only 14 are negative, so we will need to run a lot of API queries to be able to significantly impact our BERTopic algorithm.

We can now save the complete list of retrieved businesses and reviews for further use.

In [121]:
new_rev.to_csv("new_rev.csv")
new_biz.to_csv("new_biz.csv")

Now that we've completed Yelp review data processing, we will investigate the Photo database.

Our main goal is to classify pictures based on the Yelp labels.

5 labels have been identified :

  • Food
  • Drink
  • Inside
  • Outside
  • Menu

5. Image Data Preprocessing¶

We will first load the image paths so that we can retrieve them later on.

5.1 Loading our image database¶

In [7]:
from skimage import io
import os
import glob

data_path = "Data/Photos"

#Retrieving photo names
photos_path = os.path.join(data_path, '*')
photos_path = glob.glob(photos_path)

Let's visualize a random image :

In [227]:
#Sampling a random image
image = io.imread(photos_path[5])

#Plotting the image
i, (im1) = plt.subplots(1)
i.set_figwidth(15)
im1.imshow(image)
plt.grid(None)
plt.show()

It works! Now we can turn our photo.json file into a Dataframe to be able to retrieve information and more importantly the labels of these photos.

In [8]:
photos = pd.read_json("Data/photos.json", lines=True)

photos.head()
Out[8]:
photo_id business_id caption label
0 zsvj7vloL4L5jhYyPIuVwg Nk-SJhPlDBkAZvfsADtccA Nice rock artwork everywhere and craploads of ... inside
1 HCUdRJHHm_e0OCTlZetGLg yVZtL5MmrpiivyCIrVkGgA outside
2 vkr8T0scuJmGVvN2HJelEA _ab50qdWOk0DdB6XOrBitw oyster shooter drink
3 pve7D6NUrafHW3EAORubyw SZU9c8V2GuREDN5KgyHFJw Shrimp scampi food
4 H52Er-uBg6rNrHcReWTD2w Gzur0f0XMkrVxIwYJvOt2g food

We will also add the path to our photos in the dataframe :

In [9]:
def add_path(x):
    return "Data/Photos/" + str(x)+".jpg"
    
photos["photo_path"] = photos["photo_id"].apply(add_path)

photos.head()
Out[9]:
photo_id business_id caption label photo_path
0 zsvj7vloL4L5jhYyPIuVwg Nk-SJhPlDBkAZvfsADtccA Nice rock artwork everywhere and craploads of ... inside Data/Photos/zsvj7vloL4L5jhYyPIuVwg.jpg
1 HCUdRJHHm_e0OCTlZetGLg yVZtL5MmrpiivyCIrVkGgA outside Data/Photos/HCUdRJHHm_e0OCTlZetGLg.jpg
2 vkr8T0scuJmGVvN2HJelEA _ab50qdWOk0DdB6XOrBitw oyster shooter drink Data/Photos/vkr8T0scuJmGVvN2HJelEA.jpg
3 pve7D6NUrafHW3EAORubyw SZU9c8V2GuREDN5KgyHFJw Shrimp scampi food Data/Photos/pve7D6NUrafHW3EAORubyw.jpg
4 H52Er-uBg6rNrHcReWTD2w Gzur0f0XMkrVxIwYJvOt2g food Data/Photos/H52Er-uBg6rNrHcReWTD2w.jpg

Now just as with the reviews, we are only interested in photos from Restaurants, so we will filter out our photos database based on the business list previously identified.

In [16]:
#Joining on our restaurant business table to keep only restaurant photos
print(len(photos))
photos = pd.merge(photos, business, on="business_id", how="inner")
print(len(photos))
199994
170484

This reduces the size of our database from about 200k samples to 170k samples (15% reduction).

Now we also need to check the validity of our photo files and that all of the photo referenced in our database are actually present within our source folder. We will use OpenCV to read each image and then to check if they are corrupted.

After that, we will remove corrupted images from our database.

In [11]:
#After running some iterations, we have realized that some images are corrupted
#This function will verify the path of each image using cv2
import cv2
def verify_path(x):
    img = cv2.imread(x)
    if img is None:
        return np.nan
    else:
        return x
    
photos["photo_path"] = photos["photo_path"].apply(verify_path)
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
In [18]:
photos = photos[photos.photo_path.notna()]
#5 occurences were removed

import dill
with open('Data/photos.pkl', 'wb') as file:
    dill.dump(photos, file)

This has removed 5 corrupted or missing photos from our database.

Now that our photo database has been cleaned, we are ready to start preprocessing.

5.2 Basic Image preprocessing steps¶

We will now apply preprocessing steps to our images. Instead of applying these functions to all our images, which would be very long, we apply them to only a batch of images.

We will then apply our preprocessing using Tensorflow and Keras in order to automate the preprocessing.

5.2.1 Greyscaling¶

Greyscaling could be a necessary preprocessing step because it reduces the number of channels from 3 to 1 which reduces image size and smoothes out results. Of course, our algorithm would then be unable to differentiate between colors which could be a problem.

In [228]:
#Testing manual greyscaling
from cucim.skimage.color import rgb2gray
import cupy as cp
img_grey = rgb2gray(cp.array(image))
plt.imshow(cp.asnumpy(img_grey))
Out[228]:
<matplotlib.image.AxesImage at 0x7f28b3695130>

5.2.2 Image normalization¶

We will now use the cupy package to perform a basic normalization of our image :

In [231]:
#Image normalization with cupy
img_norm = (img_grey - cp.min(img_grey))/ (cp.max(img_grey) - cp.min(img_grey))
plt.imshow(cp.asnumpy(img_norm))
Out[231]:
<matplotlib.image.AxesImage at 0x7f28b3f47fd0>

5.3 Tensorflow and Keras Preprocessing¶

We will now switch from basic preprocessing steps to Tensorflow and Keras in order to reduce RAM usage and to be able to reproduce the preprocessing steps at scale more easily.

5.3.1 Loading our database on Tensorflow¶

The first thing we need is load our database on Tensorflow.

We will perform greyscale conversion when loading the images in order to reduce the overhead. The load function also automatically reduces the image to the desired scale, here 160x160.

In [112]:
#Using tensorflow and keras
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds

BATCH_SIZE = 32
IMG_SIZE = 160

test = photos[photos.label.notna()].head(500) #Creating a sample of our dataset to test deployment

x = test["photo_path"]
y = test["label"]

def load(file_path):

    img = tf.io.read_file(file_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, size=(IMG_SIZE, IMG_SIZE)) #Resize
    img = tf.image.rgb_to_grayscale(img) #Converts to greyscale
    return img

photo_ds = tf.data.Dataset.from_tensor_slices((x,y)).map(lambda x,y : (load(x), y))

next(iter(photo_ds))
['inside' 'food' 'drink' 'outside' 'menu']
Out[112]:
(<tf.Tensor: shape=(160, 160, 1), dtype=float32, numpy=
 array([[[0.08659334],
         [0.08659334],
         [0.08659334],
         ...,
         [0.255501  ],
         [0.25374258],
         [0.26273412]],
 
        [[0.08659334],
         [0.08659334],
         [0.08659334],
         ...,
         [0.26046377],
         [0.26022068],
         [0.2752608 ]],
 
        [[0.08571423],
         [0.08571423],
         [0.08571423],
         ...,
         [0.26663992],
         [0.2686267 ],
         [0.27975687]],
 
        ...,
 
        [[0.27996668],
         [0.3602888 ],
         [0.35526276],
         ...,
         [0.02429049],
         [0.03628377],
         [0.03729754]],
 
        [[0.3682305 ],
         [0.4474603 ],
         [0.58711964],
         ...,
         [0.08658904],
         [0.06012097],
         [0.0773527 ]],
 
        [[0.39774716],
         [0.3790933 ],
         [0.35402676],
         ...,
         [0.05249134],
         [0.06983256],
         [0.07253755]]], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'inside'>)
In [8]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_for_performance(ds):
    ds = ds.cache()
    ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(BATCH_SIZE)
    ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds

photo_ds = configure_for_performance(photo_ds)

Now let's verify that the photos are correctly loaded into the tensorflow database :

In [23]:
image_batch, label_batch = next(iter(photo_ds))

plt.figure(figsize=(10, 10))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image_batch[i].numpy())
    plt.title(label_batch[i].numpy().decode('UTF-8'))
    plt.axis("off")

Now that our photos are correctly loaded, let's continue performing some preprocessing steps.

5.3.2 Image standardization¶

We will now standardize our images by Rescaling them.

Let's look at our results :

In [25]:
#Standardizing our images
normalization_layer = tf.keras.layers.Rescaling(1./255)
normalized_ds = photo_ds.map(lambda x,y : (normalization_layer(x), y))
image_batch, label_batch = next(iter(normalized_ds))

#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image_batch[i].numpy())
    plt.title(label_batch[i].numpy().decode('UTF-8'))
    plt.axis("off")

This hasn't a big effect visually but will allow our images to be processed by classification or feature extraction algorithms.

We will now perform histogram equalization of our images.

5.3.3 Histogram Equalization¶

There is no available function on Tensorflow to perform equalization, so I had to implement a custom Keras layer.

Let's look at our results :

In [7]:
#Defining custom KERA layers 

from skimage import exposure
from tensorflow.keras.layers import Layer, Input, Conv2D
from tensorflow.keras.models import Model


def equalize(img):
    for channel in range(img.shape[2]):  # equalizing each channel
        img[:, :, channel] = exposure.equalize_hist(img[:, :, channel])
    return img.astype(np.float32)


def preprocess_input(img):
    x = tf.numpy_function(equalize,
                   [img],
                   'float32',
                   name='histogram_equalization')
    return tf.cast(x, tf.float32)


class EqualizingLayer(Layer):
    def __init__(self,  **kwargs):
        self.trainable = False
        super(EqualizingLayer, self).__init__(**kwargs)
        
    def compute_output_shape(self, input_shape):
        return ((input_shape[0], input_shape[1], input_shape[2], input_shape[3]))
    
    def build(self, input_shape):
        super().build(input_shape)

    def call(self, x):
        res = tf.map_fn(preprocess_input, x)
        res.set_shape(self.compute_output_shape(x.get_shape())) #No change to the shape
        return res
In [109]:
equalizing_layer = EqualizingLayer()
equalized_ds = normalized_ds.map(lambda x,y : (equalizing_layer(x), y))
image_batch, label_batch = next(iter(equalized_ds))

#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image_batch[i].numpy())
    plt.title(label_batch[i].numpy().decode('UTF-8'))
    plt.axis("off")

It is clear that performing histogram equalization has improved the shape of our images, the brighness has been smoothed out allowing us to see parts of the photos that were previously dark.

To improve our algorithms, we will now perform data augmentation on our images, by performing random operations and adding Gaussian noise. The main use of data augmentation is to prevent overfitting by our algorithm. It is also useful to create new samples if our initial database was to small, which is not the case here.

5.3.4 Image data augmentation¶

We will perform 4 data augmentation operations here :

  • Random Flip (horizontal and vertical)
  • Random Rotation of the image
  • Random zoom
  • Performing Gaussian smoothing of our images
In [110]:
#Performing data augmentation
from tensorflow.keras import layers


data_augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),
    layers.RandomZoom(0.1),
    layers.GaussianNoise(1e-4),
    
])

augmented_ds = equalized_ds.map(lambda x,y: (data_augmentation(x, training=True), y))

image_batch, label_batch = next(iter(augmented_ds))

#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image_batch[i].numpy())
    plt.title(label_batch[i].numpy().decode('UTF-8'))
    plt.axis("off")
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.

We can clearly see that our data augmentation steps have worked as some images have been flipped.

In [ ]:
#Summarizing our preprocessing steps:
pre_processing = tf.keras.Sequential([
    layers.Rescaling(1./255)
    EqualizingLayer(),
    layers.GaussianNoise(1e-4)
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),
    layers.RandomZoom(0.1),
])

#Converting to greyscale and correcting image_size has been done on dataset generation

Now that we have completed preprocessing, we are still unable to visualize the shape of our image database because our images are still 160 160 3 vectors.

In order to perform feature extraction, we will use Transfer Learning and use the Google MobileNetV2 algorithm to extract important features from our images.

6. Feature selection using Transfer Learning¶

6.1 Full dataset loading¶

We will now load the full dataset to Tensorflow :

In [6]:
with open('Data/photos.pkl', 'rb') as file:
    photos = dill.load(file)
In [18]:
#Using tensorflow and keras
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder

TEST_SIZE = 0.1

df = photos[photos.label.notna()][["photo_path","label"]] #Loading full dataset with no missing labels

#Defining label mapping for further use
label_mapping = [{'col': 'label', 'mapping':{'food': 0, 'drink':1, 'inside':2, 'outside': 3, 'menu': 4}}]

df = OrdinalEncoder(cols="label", mapping=label_mapping).fit_transform(df)

train, test = train_test_split(df, test_size=TEST_SIZE)
train, val = train_test_split(train, test_size =TEST_SIZE/(1-TEST_SIZE))

X_train, X_test, X_val = train["photo_path"], test["photo_path"], val["photo_path"]
y_train, y_test, y_val = train["label"], test["label"], val["label"]

IMG_SIZE = 160
BATCH_SIZE = 32

def load(file_path):

    img = tf.io.read_file(file_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, size=(IMG_SIZE, IMG_SIZE)) #Resize
    #img = tf.image.rgb_to_grayscale(img) #We do not perform grayscale conversion for this model
    return img

train_ds = tf.data.Dataset.from_tensor_slices((X_train,y_train)).map(lambda x,y : (load(x), y))
test_ds = tf.data.Dataset.from_tensor_slices((X_test,y_test)).map(lambda x,y : (load(x), y))
val_ds = tf.data.Dataset.from_tensor_slices((X_val,y_val)).map(lambda x,y : (load(x), y))

next(iter(train_ds))
Out[18]:
(<tf.Tensor: shape=(160, 160, 3), dtype=float32, numpy=
 array([[[0.16470589, 0.16470589, 0.13333334],
         [0.16078432, 0.16078432, 0.12941177],
         [0.15686275, 0.15686275, 0.1254902 ],
         ...,
         [0.10980393, 0.10196079, 0.05147059],
         [0.10196079, 0.10196079, 0.07058824],
         [0.10588236, 0.10588236, 0.07450981]],
 
        [[0.16470589, 0.16470589, 0.13333334],
         [0.16078432, 0.16078432, 0.12941177],
         [0.15686275, 0.15686275, 0.1254902 ],
         ...,
         [0.11078432, 0.10294119, 0.05245098],
         [0.10588236, 0.10588236, 0.07450981],
         [0.10980393, 0.10980393, 0.07843138]],
 
        [[0.16470589, 0.16470589, 0.13333334],
         [0.16078432, 0.16078432, 0.12941177],
         [0.15686275, 0.15686275, 0.1254902 ],
         ...,
         [0.11629903, 0.10330883, 0.05539216],
         [0.11433824, 0.10551471, 0.07120098],
         [0.11752452, 0.10870099, 0.07438726]],
 
        ...,
 
        [[0.02267157, 0.01090686, 0.        ],
         [0.02843137, 0.01666667, 0.        ],
         [0.03235294, 0.02058824, 0.00098039],
         ...,
         [0.2927696 , 0.31850493, 0.34203434],
         [0.2824755 , 0.3139706 , 0.32598042],
         [0.37426472, 0.39828435, 0.41004905]],
 
        [[0.03186275, 0.02009804, 0.00110294],
         [0.03529412, 0.02352941, 0.00392157],
         [0.03235294, 0.02058824, 0.00098039],
         ...,
         [0.2507353 , 0.27169117, 0.28468138],
         [0.37622553, 0.40551475, 0.42598042],
         [0.38676476, 0.41078436, 0.42254907]],
 
        [[0.02855392, 0.01678922, 0.        ],
         [0.03394608, 0.02218137, 0.00257353],
         [0.03627451, 0.02450981, 0.00490196],
         ...,
         [0.2839461 , 0.30012256, 0.31188726],
         [0.41372553, 0.44901964, 0.46862748],
         [0.36164218, 0.38566178, 0.3974265 ]]], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

It is not necessary to perform Greyscale conversion, so we will slightly modify our preprocessing pipeline to accomodate this.

We will now load the model and its preprocessing function that will be used in our preprocessing pipeline.

In [9]:
#We will now perform feature selection using the mobilenetV2 model.
#We have to use the model's preprocessing function, which just adds a Rescaling Layer
preprocess_input = tf.keras.applications.mobilenet_v2.preprocess_input

IMG_SHAPE = (IMG_SIZE,IMG_SIZE) + (3,)
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
                                               include_top=False,
                                               weights='imagenet')

base_model.trainable = False
In [19]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_for_performance(ds):
    ds = ds.cache()
    ds = ds.batch(BATCH_SIZE)
    #ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds

# train_ds = configure_for_performance(train_ds)
# test_ds = configure_for_performance(test_ds)
# val_ds = configure_for_performance(val_ds)
In [11]:
#Defining custom KERA layers 

from skimage import exposure
from tensorflow.keras.layers import Layer, Input, Conv2D
from tensorflow.keras.models import Model


def equalize(img):
    for channel in range(img.shape[2]):  # equalizing each channel
        img[:, :, channel] = exposure.equalize_hist(img[:, :, channel])
    return img.astype(np.float32)


def preprocess_input(img):
    x = tf.numpy_function(equalize,
                   [img],
                   'float32',
                   name='histogram_equalization')
    return tf.cast(x, tf.float32)


class EqualizingLayer(Layer):
    def __init__(self,  **kwargs):
        self.trainable = False
        super(EqualizingLayer, self).__init__(**kwargs)
        
    def compute_output_shape(self, input_shape):
        return ((input_shape[0], input_shape[1], input_shape[2], input_shape[3]))
    
    def build(self, input_shape):
        super().build(input_shape)

    def call(self, x):
        res = tf.map_fn(preprocess_input, x)
        res.set_shape(self.compute_output_shape(x.get_shape())) #No change to the shape
        return res
In [20]:
from tensorflow.keras import layers
import warnings

#Defining our preprocessing steps
pre_process = tf.keras.Sequential([
    layers.Rescaling(1./127.5, offset=-1), #Our model required preprocessing rescaling
    EqualizingLayer(), #Our custom Histogram equalization method
])

data_augmentation = tf.keras.Sequential([
    layers.GaussianNoise(1e-4), #Smoothing our image with gaussian noise
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),
    layers.RandomZoom(0.1),
])

categorical_data_encoding = layers.CategoryEncoding(num_tokens=5, output_mode="one_hot")

def prepare(ds, shuffle=False, augment=False):
#     ds = ds.map(lambda x, y: (pre_process(x), y)) #Integrated to model
    
    #ds = ds.map(lambda x, y: (x, categorical_data_encoding(y))) #Performing one hot encoding on target variable
    
    if shuffle:
        ds.shuffle(1000)
    
    #Use augmentation only on the training set
    
#     if augment:
#         ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y)) #Integrated to model
        
    return ds


with warnings.catch_warnings(): #This code produces many warnings because of a current tensorflow issue with data aug. layers
    warnings.simplefilter("ignore")
    train_ds = prepare(train_ds, shuffle=True, augment=True)
    test_ds = prepare(test_ds)
    val_ds = prepare(val_ds)
    
train_ds = configure_for_performance(train_ds)
test_ds = configure_for_performance(test_ds)
val_ds = configure_for_performance(val_ds)

6.2 Feature Extraction using MobileNetV2 :¶

We also automate the preprocessing step by including it in a function that we can then apply to our 3 datasets :

  • Training set
  • Testing set
  • Validation set

Let's look at the shape of our data after it has been processed by our base model :

In [34]:
image_batch, label_batch = next(iter(train_ds))
feature_batch = base_model(image_batch)
print(feature_batch.shape)
#Our base model converts our feature batch into a 5*5*1280 block of features
2022-10-12 08:18:35.849750: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
(32, 5, 5, 1280)

It is still multidimensional, to be able to perform data visualization, we need to reduce its shape to a 2D vector.

We will apply an additional layer to our model called Global Average Pooling 2D that will reduce the shape of our data to a 2D vector.

In [32]:
#To generate predictions, we need to turn our 5 X 5 X 1280 vector into 1280 features
#We will apply a GlobalAveragePooling2D layer to average out these 5 X 5 dimensions into a 2D vector

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
feature_batch_average = global_average_layer(feature_batch)
print(feature_batch_average.shape)
(32, 1280)

We can see that each image now has 1280 features. (The 32 indicates our batch number)

We will automate this process by regenerating a model and applying it to our validation set (for faster processing).

In [13]:
#We will modify our model to add the GlobalPooling Layer
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

base_model.trainable = False

inputs = tf.keras.Input(shape=(IMG_SIZE,IMG_SIZE,3))
x = base_model(inputs, training=False)
outputs = global_average_layer(x)
flatten_model = tf.keras.Model(inputs, outputs)
flatten_model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 160, 160, 3)]     0         
                                                                 
 mobilenetv2_1.00_160 (Funct  (None, 5, 5, 1280)       2257984   
 ional)                                                          
                                                                 
 global_average_pooling2d (G  (None, 1280)             0         
 lobalAveragePooling2D)                                          
                                                                 
=================================================================
Total params: 2,257,984
Trainable params: 0
Non-trainable params: 2,257,984
_________________________________________________________________
In [17]:
train_ds_red = flatten_model.predict(train_ds)
train_ds_red.shape
 152/4263 [>.............................] - ETA: 16:17
2022-10-13 06:52:22.144458: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 593/4263 [===>..........................] - ETA: 14:11
2022-10-13 06:54:03.575389: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1011/4263 [======>.......................] - ETA: 12:36
2022-10-13 06:55:41.321992: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1107/4263 [======>.......................] - ETA: 12:14
2022-10-13 06:56:03.705347: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1188/4263 [=======>......................] - ETA: 11:56
2022-10-13 06:56:22.714570: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1377/4263 [========>.....................] - ETA: 11:12
2022-10-13 06:57:06.731986: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1441/4263 [=========>....................] - ETA: 10:59
2022-10-13 06:57:22.566856: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1481/4263 [=========>....................] - ETA: 10:50
2022-10-13 06:57:31.841754: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1525/4263 [=========>....................] - ETA: 10:40
2022-10-13 06:57:42.546412: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1644/4263 [==========>...................] - ETA: 10:12
2022-10-13 06:58:10.352047: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1694/4263 [==========>...................] - ETA: 10:01
2022-10-13 06:58:22.617049: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1912/4263 [============>.................] - ETA: 9:12
2022-10-13 06:59:14.982394: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2095/4263 [=============>................] - ETA: 8:31
2022-10-13 06:59:59.859071: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2253/4263 [==============>...............] - ETA: 7:55
2022-10-13 07:00:39.317355: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2536/4263 [================>.............] - ETA: 6:50
2022-10-13 07:01:48.345910: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2550/4263 [================>.............] - ETA: 6:46
2022-10-13 07:01:51.796983: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2602/4263 [=================>............] - ETA: 6:34
2022-10-13 07:02:04.375986: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2603/4263 [=================>............] - ETA: 6:34
2022-10-13 07:02:04.608257: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2661/4263 [=================>............] - ETA: 6:21
2022-10-13 07:02:19.246332: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3117/4263 [====================>.........] - ETA: 4:33
2022-10-13 07:04:10.366373: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3365/4263 [======================>.......] - ETA: 3:35
2022-10-13 07:05:11.518628: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3608/4263 [========================>.....] - ETA: 2:37
2022-10-13 07:06:11.725403: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4029/4263 [===========================>..] - ETA: 56s
2022-10-13 07:07:54.930348: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4263/4263 [==============================] - 1027s 241ms/step
Out[17]:
(136386, 1280)
In [18]:
with open("Data/train_reduced.pkl", "wb") as file:
    dill.dump(train_ds_red, file)
In [7]:
with open("Data/train_reduced.pkl", "rb") as file:
    train_ds_red = dill.load(file)

We can see that this has functionned, and it has transformed our validation dataset into a 136386 * 1280 numpy array.

We now need to recover the labels from the Tensorflow dataframe :

In [21]:
train_labels = np.concatenate([y for x, y in train_ds], axis=0)

train_labels.shape
2022-10-13 13:50:27.955572: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:36.206098: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:44.567073: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:46.075693: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:46.609357: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:50.032763: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:50.726493: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:51.822963: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:52.067416: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:50:53.969117: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:02.972625: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:06.101412: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:07.370873: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:09.019480: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:13.012292: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:23.076330: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:29.714270: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:34.765711: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:36.116822: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:43.967525: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:46.295308: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-13 13:51:47.764009: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
Out[21]:
(136386,)
In [22]:
with open("Data/train_labels.pkl", "wb") as file:
    dill.dump(train_labels, file)
In [8]:
with open("Data/train_labels.pkl", "rb") as file:
    train_labels = dill.load(file)

Now that we have recovered the labels, we need to perform categorical feature encoding and turn them into numbers using a Label Encoder.

This will be used for the visualization of clusters.

In [ ]:
# from sklearn.preprocessing import LabelEncoder

# train_labels = LabelEncoder().fit_transform(train_labels)

# train_labels[1]

We are now ready to apply dimensionality reduction to our dataset.

6.3 Dimensionality Reduction and Visualization with UMAP¶

We will apply UMAP to reduce our dataset to 2 components and visualize it :

In [9]:
import umap.umap_ as umap

umap_viz = umap.UMAP(n_neighbors=15, n_components=3, verbose=True, metric='cosine').fit_transform(train_ds_red)
result = pd.DataFrame(umap_viz, columns=['x', 'y','z'])
result['labels'] = train_labels
UMAP(angular_rp_forest=True, metric='cosine', n_components=3, verbose=True)
Thu Oct 20 14:13:47 2022 Construct fuzzy simplicial set
Thu Oct 20 14:13:48 2022 Finding Nearest Neighbors
Thu Oct 20 14:13:48 2022 Building RP forest with 23 trees
Thu Oct 20 14:13:55 2022 NN descent for 17 iterations
	 1  /  17
	 2  /  17
	 3  /  17
	 4  /  17
	 5  /  17
	 6  /  17
	 7  /  17
	 8  /  17
	Stopping threshold met -- exiting after 8 iterations
Thu Oct 20 14:14:11 2022 Finished Nearest Neighbor Search
Thu Oct 20 14:14:13 2022 Construct embedding
Epochs completed:   0%|            0/200 [00:00]
Thu Oct 20 14:14:57 2022 Finished embedding
In [10]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#Retrieving label
label_mapping = [{'col': 'label', 'mapping':{'food': 0, 'drink':1, 'inside':2, 'outside': 3, 'menu': 4}}]
mapping = label_mapping[0]['mapping']

fig = make_subplots()
fig.update_layout(height=600, width=1000)

colors = {0: 'blue', 1: 'red', 2: 'green', 3: 'yellow', 4: 'brown'}


for i in range(5):
    subset = result[result.labels == i]
    fig.add_trace(go.Scatter3d(
    x=subset['x'],
    y=subset['y'],
    z=subset['z'],
    mode='markers',
    name = [k for k,v in mapping.items() if v == i][0].capitalize(),
    marker=dict(
        size=3,
        color=colors[i],
        opacity=0.7
    )
    ))
fig.update_layout(title="3D Visualization of our Image Data",
          legend_title = 'Label',
          font = dict(
          size=18),
          showlegend=True)
fig.show()
In [69]:
# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
plt.scatter(result.x, result.y, c=result.labels, s=0.9,cmap='hsv_r')
plt.colorbar()
Out[69]:
<matplotlib.colorbar.Colorbar at 0x7fa9d8f3ca00>

We can see that this process has not cleanly separated clusters. A possible reason is that reducing a 1280 feature dataset into just 2 features has reduced the information too much.

We will now use our MobilNetV2 model to predict the labels of our photos.

7. Predicting the classes using Transfer Learning¶

At first we need to add layers to our model so that it can actually predict our five classes.

7.1 Creating a new model¶

In [16]:
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

prediction_layer = tf.keras.Sequential([
        layers.Dense(1024, input_dim=1280),
        layers.LeakyReLU(),
        layers.Dense(512),
        layers.LeakyReLU(),
        layers.Dense(256),
        layers.LeakyReLU(),
        layers.Dropout(.3),
        layers.Dense(128),
        layers.LeakyReLU(),
        #layers.Dropout(.2),
        layers.Dense(5),
        layers.Softmax()
    ])

inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = pre_process(inputs)
x = data_augmentation(x)
x = base_model(x, training=False)
x = global_average_layer(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)

Then we compile our new model and display our model summary :

In [17]:
model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])
# categorical cross entropy is taken since its used as a loss function for
# multi-class classification problems where there are two or more output labels.
# using Adam optimizer for better performance
# other optimizers such as sgd can also be used depending upon the model

model.summary()
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 160, 160, 3)]     0         
                                                                 
 sequential_3 (Sequential)   (None, 160, 160, 3)       0         
                                                                 
 sequential_4 (Sequential)   (None, 160, 160, 3)       0         
                                                                 
 mobilenetv2_1.00_160 (Funct  (None, 5, 5, 1280)       2257984   
 ional)                                                          
                                                                 
 global_average_pooling2d_1   (None, 1280)             0         
 (GlobalAveragePooling2D)                                        
                                                                 
 sequential_5 (Sequential)   (None, 5)                 2001413   
                                                                 
=================================================================
Total params: 4,259,397
Trainable params: 2,001,413
Non-trainable params: 2,257,984
_________________________________________________________________

Now we evaluate our model on the validation dataset to see what are its initial results :

In [114]:
loss0, accuracy0 = model.evaluate(val_ds)
401/533 [=====================>........] - ETA: 23s - loss: 1.5967 - accuracy: 0.2262
2022-10-12 11:02:30.849572: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2022-10-12 11:02:30.871841: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
533/533 [==============================] - 97s 179ms/step - loss: 1.5922 - accuracy: 0.2328
In [115]:
print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))
initial loss: 1.59
initial accuracy: 0.23

We can see that our untrained model has a very weak accuracy.

Let's now train our model and see how these results evolve :

7.2 Model training¶

For this phase, we freeze the base_model's hyperparameters using the trainable = False command. We actually fit our model only on the newly created layers.

Because of the limitation of my computer, it is only possible to train the model on 2 epochs (and half the training dataset) before we run into RAM issues.

To improve this further, we could run the model training on Azure.

In [ ]:
initial_epochs = 5

import gc
#Defining custom Call Back to prevent memory leak
class MyCustomCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        gc.enable()
        tf.keras.backend.clear_session() #Resets RAM usage after every EPOCH
        gc.collect()
        
mem_clear = MyCustomCallback()

with tf.device('/cpu:0'): #GPU slows down calculation time
    history = model.fit(train_ds, #No validation set is provided to prevent memory leak
                        epochs=initial_epochs,
                        callbacks=[mem_clear])
Epoch 1/5
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
  47/4263 [..............................] - ETA: 18:32 - loss: 0.7547 - accuracy: 0.7693
2022-10-13 11:21:44.825785: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
  94/4263 [..............................] - ETA: 18:18 - loss: 0.5828 - accuracy: 0.8085
2022-10-13 11:21:57.157532: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 441/4263 [==>...........................] - ETA: 16:57 - loss: 0.4442 - accuracy: 0.8455
2022-10-13 11:23:29.791571: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 471/4263 [==>...........................] - ETA: 16:53 - loss: 0.4375 - accuracy: 0.8479
2022-10-13 11:23:38.325015: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 645/4263 [===>..........................] - ETA: 16:07 - loss: 0.4251 - accuracy: 0.8520
2022-10-13 11:24:24.890400: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 735/4263 [====>.........................] - ETA: 15:44 - loss: 0.4197 - accuracy: 0.8551
2022-10-13 11:24:49.105155: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
 853/4263 [=====>........................] - ETA: 15:11 - loss: 0.4136 - accuracy: 0.8568
2022-10-13 11:25:20.440783: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1573/4263 [==========>...................] - ETA: 11:56 - loss: 0.3847 - accuracy: 0.8677
2022-10-13 11:28:31.654764: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1589/4263 [==========>...................] - ETA: 11:52 - loss: 0.3843 - accuracy: 0.8679
2022-10-13 11:28:35.864316: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1786/4263 [===========>..................] - ETA: 10:59 - loss: 0.3806 - accuracy: 0.8695
2022-10-13 11:29:27.576058: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1895/4263 [============>.................] - ETA: 10:29 - loss: 0.3778 - accuracy: 0.8706
2022-10-13 11:29:56.201668: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2172/4263 [==============>...............] - ETA: 9:14 - loss: 0.3718 - accuracy: 0.8721
2022-10-13 11:31:08.666962: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2211/4263 [==============>...............] - ETA: 9:04 - loss: 0.3709 - accuracy: 0.8723
2022-10-13 11:31:18.910327: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2462/4263 [================>.............] - ETA: 7:58 - loss: 0.3680 - accuracy: 0.8737
2022-10-13 11:32:25.910498: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2687/4263 [=================>............] - ETA: 6:58 - loss: 0.3639 - accuracy: 0.8749
2022-10-13 11:33:26.461177: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3232/4263 [=====================>........] - ETA: 4:35 - loss: 0.3554 - accuracy: 0.8779
2022-10-13 11:35:55.949006: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3249/4263 [=====================>........] - ETA: 4:31 - loss: 0.3554 - accuracy: 0.8778
2022-10-13 11:36:00.855392: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3627/4263 [========================>.....] - ETA: 2:50 - loss: 0.3502 - accuracy: 0.8798
2022-10-13 11:37:42.697149: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3667/4263 [========================>.....] - ETA: 2:39 - loss: 0.3495 - accuracy: 0.8799
2022-10-13 11:37:53.545999: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4000/4263 [===========================>..] - ETA: 1:10 - loss: 0.3464 - accuracy: 0.8809
2022-10-13 11:39:24.730068: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4073/4263 [===========================>..] - ETA: 50s - loss: 0.3453 - accuracy: 0.8812
2022-10-13 11:39:45.159087: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4263/4263 [==============================] - 1148s 268ms/step - loss: 0.3441 - accuracy: 0.8816
Epoch 2/5
4263/4263 [==============================] - 1125s 264ms/step - loss: 0.2973 - accuracy: 0.8974
Epoch 3/5
4263/4263 [==============================] - 1126s 264ms/step - loss: 0.2792 - accuracy: 0.9033
Epoch 4/5
4263/4263 [==============================] - 1110s 260ms/step - loss: 0.2704 - accuracy: 0.9060
Epoch 5/5
 271/4263 [>.............................] - ETA: 17:21 - loss: 0.2794 - accuracy: 0.9057
In [ ]:
loss0, accuracy0 = model.evaluate(test_ds)

print("Final loss: {:.2f}".format(loss0))
print("Final accuracy: {:.2f}".format(accuracy0))
402/533 [=====================>........] - ETA: 25s - loss: 0.1857 - accuracy: 0.9401
2022-10-12 17:02:31.258458: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
404/533 [=====================>........] - ETA: 25s - loss: 0.1852 - accuracy: 0.9404

Our final accuracy is very good at 0.94 for only 5 epochs of training.

We could increase the accuracy further by fine-tuning our model by unfreezing some of the layers of our base model that we could then fit to our dataset.